Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 23, 2025

Trac ticket: Core-63864

Status

Please feel free to ignore this for now.

Description

The existing wp_iso_descrambler() was added in 2004 because certain email subjects were appearing with funny-looking string spans. The following note was left as a comment:

this may only work with iso-8859-1, I'm afraid

But even so, it’s only likely to truly work with US-ASCII, which is rare to find in such a MIME-encoded string. In 2004 it might have been more common for PHP systems to operate on ISO-8859-1 (latin1) as their default, but today UTF-8 is the predominant encoding and because the function return the bytes as they are directly encoded, it fails to perform its main function which is to translate non-ASCII encodings.

Screenshot 2025-07-23 at 12 53 25 PM

The above image illustrates how the bytes print as an invalid UTF-8 sequence in trunk after decoding. The 0x80 byte was chosen for this demonstration because in latin1 it’s a control character, in cp1252 and in HTML it’s remapped to the Euro sign, and in UTF-8 it’s an invalid sequence.

Without additional conversion calling code has to know the additional details of what the encoding is of the running PHP system and what other code will perform re-encoding. It’s likely to mess up. Worse, if the encoding is not ISO-8859-1 (latin1) then the decoding is wrong for all character sets.

var_dump( wp_iso_descrambler( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(4) "��d�"

This patch implements a compliant RFC2047 MIME text decoder, and decodes the text into UTF-8. Decoding into a single encoding normalizes the output and gives calling code the freedom to change the encoding if it wants without needing to make any assumptions or inquire about what it gets.

var_dump( rfc2047_decode( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(6) "Łódź"

With the same input as above we can see that the default output is now converted from the indicated input encoding. In this example, that decodes to a control character in UTF-8 but that is authentic to the given input. The re-encodings are now invalid because the returned data is already in UTF-8.

Screenshot 2025-07-23 at 12 54 19 PM

Supported encodings

This implementation attempts to support as many encodings as are practical based on the availability of decoding logic on the running server.

If mb_convert_encoding() is available it will be preferred, followed by iconv(), followed by direct conversion from US-ASCII or UTF-8 byte streams. Nuances and peculiarities of the PHP text-encoding functions are left as artifacts of PHP and not addressed in this function.

Error handling

Unfortunately, even where iconv_mime_decode() is available, its error-handling options are limited and unclear. By implementing the encoder in user-space the error cases can be explicitly handled, and this implementation provides configurable error handling:

  • By default, invalid encoded words are preserved as unencoded plain text. This corresponds to the preserve-errors flag. The input text will appear in the output and look jumbled, but perhaps a human can make sense of the data in it. This is how most decoders handle errors.
  • Passing in replace-errors will remove the entire encoded word and replace it with the replacement character U+FFFD . This discards information from the input, but leaves a placemarker indicating that it was there before.
  • Passing in bail-on-error will cause the function to return early and return null, effectively the same as the strict mode in other decoders.

There are multiple classes of potential errors and error behavior is not defined in the RFC. This implementation treats all classes in the same way, except for the rule that encoded words must be 75 characters or shorter (as this rule was clearly intended for encoders to make the job of decoding simpler, but otherwise does not speak to the well-formedness of the encoding).

  • Unsupported character sets.
  • Invalid encodings (B and Q are supported).
  • Invalid byte sequences in the quoted-printable encoding, such as =. or =6f (only upper-case hex digits are allowed).
  • Invalid base64-decoding in the binary encoding.
  • Invalid character re-encoding on the decoded byte stream.

Of note, the RFC implies no possible syntax errors. Instead, anything which appears as a syntax error indicates that the span of text which looks like an encoded word is actually just plain text and the parser will skip over it to look for the next well-formed encoded word.

Notes

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the add/mime-decoder branch 2 times, most recently from 42a1358 to eca673c Compare August 12, 2025 20:11
@dmsnell dmsnell force-pushed the add/mime-decoder branch 5 times, most recently from ae3f2bd to 4b4ef54 Compare September 16, 2025 12:43
@dmsnell dmsnell force-pushed the add/mime-decoder branch 6 times, most recently from f09a528 to 69fc308 Compare September 25, 2025 18:47
@dmsnell dmsnell force-pushed the add/mime-decoder branch 5 times, most recently from 4965e92 to 3efffd0 Compare October 6, 2025 21:17
@dmsnell dmsnell force-pushed the add/mime-decoder branch 4 times, most recently from a3fdc53 to e974fdf Compare October 9, 2025 23:39
@dmsnell dmsnell force-pushed the add/mime-decoder branch 3 times, most recently from 0c0f4e7 to 0d97c25 Compare October 21, 2025 09:22
@dmsnell dmsnell mentioned this pull request Nov 6, 2025
Questions arise around unspecified failure behaviors.

 - What if the syntax is obviously supposed to be an encoding but
   technically isn’t? For example, it’s missing a closing '?' It
   may be computationally heavy to _guess_ if something is broken
   syntax, so some failures are ambiguous if they should copy the
   input plaintext or return null.

 - What do other high-quality libraries do with errors?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant